[SPARK-5929][PYSPARK] Context addPyPackage and addPyRequirements #12398

buckhx · 2016-04-14T14:53:47Z

What changes were proposed in this pull request?

Context.addPyPackage()
Context.addPyRequirements()

Both of these methods take a package on the master and ship it to the workers when called instead of having to manually install packages on all workers.

How was this patch tested?

Unit tests are written, but I do not believe they accurately reflect a distributed environment. The test_add_py_package is not using addPyPackage and still works. The addRequirementsFile method requires internet access to hit the global pypi server and won't work on the current Jenkins build system.

We have had this patch running at Palantir for about a year in production.

holdenk · 2016-04-14T21:58:50Z

So looking back on the predecessor PR it seemed like @davies suggested adding support for specifying these at runtime with spark-submit & it seems like the pip install permissions issue might still be present for testing.

holdenk · 2016-04-14T21:59:31Z

I think having this supported could be very useful :)

holdenk · 2016-04-14T22:09:26Z

python/pyspark/context.py

+        import pip
+        for req in pip.req.parse_requirements(path, session=uuid.uuid1()):
+            if not req.check_if_exists():
+                pip.main(['install', req.req.__str__()])


So it seems that this can sometimesrequire elevated privileges based on the issues with the previous jenkins run. What about if at startup we created a fixed temp directory per context adding it to our path with sys.path.insert(0, self.pipBase) and at install did something along the lines of:
pip.main(['install', req.req.__str__(), '--target', self.pipBase]) so that we don't have to have write permissions to the default pip target?

holdenk · 2016-04-14T22:20:20Z

Since @andrewor14 got Jenkins to run the earlier iteration of this PR maybe you could do the same here?

buckhx · 2016-04-18T15:08:19Z

I could see the pipBase approach working.

The other testing blocker I have is that it seems like that context that's getting created in the test is able to use local packages. The test package that's created is able to be imported on the workers without distributing it via addPyPackage. https://github.com/apache/spark/pull/12398/files#diff-388a156f4ce454fe98d7a99a0f7f0012R1950

Is there a way to get the mocked workers to not use the master python path?

The pip part is nice, but I think the meat of this PR is in the addPyPackage.

I looked into adding a flag to the CLI, but kind of went down a rabbit hole trying to figure it out. I'd advocate adding it to the CLI, but think someone working in the spark core space could get that integrated much more efficiently than myself.

andrewor14 · 2016-05-04T00:44:51Z

ok to test

@davies

SparkQA · 2016-05-04T05:35:23Z

Test build #2967 has finished for PR 12398 at commit ce9966e.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-05-04T05:45:08Z

python/pyspark/tests.py

+                self.assertSequenceEqual([0, 3, 6, 9], trips.collect())
+        finally:
+            shutil.rmtree(name)
+


remove the extra empty line

davies · 2016-05-04T05:53:05Z

@buckhx These API seems useful, could you also add an argument for bin/spark-submit (only for requirement file) ?

buckhx · 2016-05-09T15:44:24Z

Added an addPyPackage example to the docstring and that test formatting. We're looking into adding the spark-submit arg for --py-requirements.

The pip API takes a file path via pip.req.parse_requirements https://github.com/pypa/pip/blob/develop/pip/req/req_file.py#L64 but it looks like it might be possible to take a string and break it out line by with a mock file name via pip.req.process_line & pip.req.preprocess https://github.com/pypa/pip/blob/develop/pip/req/req_file.py#L110

davies · 2016-05-09T16:44:30Z

@buckhx Even the pip API only takes a file path, we could write the as temporary file, I think.

add py-requirements submit option

buckhx · 2016-05-11T14:36:35Z

@robert3005 added the --py-requirements to spark-submit. I'm looking at passing the requirements as a list or a line delimited string currently.

buckhx · 2016-05-13T19:21:30Z

changed addRequirementsFile to addPyRequirements which takes a list of requirement strings

buckhx · 2016-06-06T13:55:25Z

@davies how's this looking?

SparkQA · 2016-06-06T18:40:42Z

Test build #3068 has finished for PR 12398 at commit 9c37e06.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2016-09-07T21:17:37Z

@buckhx any interest in updating this against master?

holdenk · 2016-10-07T19:51:47Z

@buckhx Just following up to see if this is something you are still interested in working on?

HyukjinKwon · 2017-02-09T13:16:01Z

ping @buckhx

buck heroux and others added 19 commits February 18, 2015 12:50

added requirements file to pyspark

0ed060d

tarfile has no contextmanager in python2.

6b8bcde

reqs fix

2773483

temp tar file

0371ad9

bubbled up try finally

f2a46e5

forgot to remove

fca4be6

added requirementsFile tests and switch to __import__

d287522

merged tests

76ff637

pep8 styling

565bf7f

support namespace packages and extract addModule logic

23771fd

tmp_dir to mod

cd21c5c

remove reqs from context constructor

39f26d9

upstream merge

49a4ed0

upstream merge

1501d0f

Merge remote-tracking branch 'upstream/master'

3af35bb

add_py_package test

88a1d6c

uncommented pip_requirements test

93b9e9f

removed todo

82476a6

spacing

ce9966e

holdenk reviewed Apr 14, 2016
View reviewed changes

davies reviewed May 4, 2016
View reviewed changes

formatting and addPyPackage example

82534d0

Robert Kruszewski and others added 2 commits May 11, 2016 00:34

add py-requirements submit option

ea6b89f

Merge pull request #1 from robert3005/fork/master

1d5d25f

add py-requirements submit option

addRequirementsFile -> addPyRequirements

f4af842

buckhx changed the title ~~[SPARK-5929][PYSPARK] Context addPyPackage and addRequirementsFile~~ [SPARK-5929][PYSPARK] Context addPyPackage and addPyRequirements May 13, 2016

pep8ing

9c37e06

HyukjinKwon mentioned this pull request Feb 15, 2017

[BUILD] Close stale PRs #16937

Closed

asfgit closed this in ed338f7 Feb 17, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-5929][PYSPARK] Context addPyPackage and addPyRequirements #12398

[SPARK-5929][PYSPARK] Context addPyPackage and addPyRequirements #12398

buckhx commented Apr 14, 2016 •

edited

Loading

holdenk commented Apr 14, 2016

holdenk commented Apr 14, 2016

holdenk Apr 14, 2016

holdenk commented Apr 14, 2016

buckhx commented Apr 18, 2016

andrewor14 commented May 4, 2016 •

edited

Loading

SparkQA commented May 4, 2016

davies May 4, 2016

davies commented May 4, 2016

buckhx commented May 9, 2016

davies commented May 9, 2016

buckhx commented May 11, 2016

buckhx commented May 13, 2016

buckhx commented Jun 6, 2016

SparkQA commented Jun 6, 2016

holdenk commented Sep 7, 2016

holdenk commented Oct 7, 2016

HyukjinKwon commented Feb 9, 2017

[SPARK-5929][PYSPARK] Context addPyPackage and addPyRequirements #12398

[SPARK-5929][PYSPARK] Context addPyPackage and addPyRequirements #12398

Conversation

buckhx commented Apr 14, 2016 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

holdenk commented Apr 14, 2016

holdenk commented Apr 14, 2016

holdenk Apr 14, 2016

Choose a reason for hiding this comment

holdenk commented Apr 14, 2016

buckhx commented Apr 18, 2016

andrewor14 commented May 4, 2016 • edited Loading

SparkQA commented May 4, 2016

davies May 4, 2016

Choose a reason for hiding this comment

davies commented May 4, 2016

buckhx commented May 9, 2016

davies commented May 9, 2016

buckhx commented May 11, 2016

buckhx commented May 13, 2016

buckhx commented Jun 6, 2016

SparkQA commented Jun 6, 2016

holdenk commented Sep 7, 2016

holdenk commented Oct 7, 2016

HyukjinKwon commented Feb 9, 2017

buckhx commented Apr 14, 2016 •

edited

Loading

andrewor14 commented May 4, 2016 •

edited

Loading